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Abstract 

Balancing selection can maintain different alleles over long evolutionary times. Beyond this direct effect on the molecular targets 
of selection, balancing selection is also expected to increase neutral polymorphism in linked genome regions, in inverse pro- 
portion to their genetic map distances from the selected sites. The genes controlling plant self-incompatibility are subject to one 
of the strongest forms of balancing selection, and they show clear signatures of balancing selection. The genome region 
containing those genes (the S-locus) is generally described as nonrecombining, and the physical size of the region with low 
recombination has recently been established in a few species. However, the size of the region showing the indirect footprints of 
selection due to linkage to the S-locus is only roughly known. Here, we improved estimates of this region by surveying 
synonymous polymorphism and estimating recombination rates at 12 flanking region loci at known physical distances from 
the S-locus region boundary, in two closely related self-incompatible plants Arabidopsis halleri and A. lyrata. In addition to 
studying more loci than previous studies and using known physical distances, we simulated an explicit demographic scenario for 
the divergence between the two species, to evaluate the extent of the genomic region whose diversity departs significantly from 
neutral expectations. At the closest flanking loci, we detected signatures of both recent and ancient indirect effects of selection 
on the S-locus flanking genes, finding ancestral polymorphisms shared by both species, as well as an excess of derived mutations 
private to either species. However, these effects are detected only in a physically small region, suggesting that recombination in 
the flanking regions is sufficient to quickly break up linkage disequilibrium with the S-locus. Our approach may be useful for 
distinguishing cases of ancient versus recently evolved balancing selection in other systems. 

Key words: self-incompatibility, balancing selection, recombination, Arabidopsis, ancestral polymorphism, approximate Bayesian 
computation. 



Introduction 

Among the promises of the ongoing sequencing revolution is 
the possibility of using patterns of molecular variation in nat- 
ural populations to detect the footprints of natural selection. 
Smith and Haigh's (1974) seminal article proposed the "gen- 
etic hitchhiking" concept, where an advantageous mutation 
spreading through a sexual population causes closely linked 
neutral alleles to also increase in frequency, reducing neutral 
and other polymorphism, in a "selective sweep." The size of 
the affected region depends on the strength of selection on 
the target site, the time since the occurrence of the selective 
sweep, and the recombination rate in the region (Kim and 
Stephan 2002). Selection against deleterious mutations also 
eliminates closely linked neutral alleles that are in linkage 
disequilibrium with the deleterious alleles; this hitchhiking 
process is called background selection (Charlesworth et al. 
1993; Loewe and Charlesworth 2007). Through both selective 
sweeps and background selection, natural selection indirectly 



reduces effective population sizes near selected sites (Williford 
and Comeron 2010), increasing the importance of genetic 
drift; in the context of selective sweeps, this is called genetic 
draft (Gillespie 2000). 

Conversely, balancing selection that maintains multiple 
alleles for long evolutionary times (Takahata and Nei 1990; 
Vekemans and Slatkin 1994) can extend the high polymorph- 
ism at the selected locus (Maruyama and Nei 1981) to closely 
linked neutral sites (Strobeck 1983; Charlesworth et al. 1997; 
Meagher and Potts 1997; Hudson and Kaplan 1988; Schierup 
et al. 2000). Balancing selection can be viewed as causing a 
local increase in effective population size, which depends on 
the local recombination rate, the strength of selection, and 
the timescale of maintenance of the polymorphism 
(Charlesworth et al. 1997; Schierup et al. 2000). The situation 
is analogous to neutral differentiation in a subdivided popu- 
lation, with demes replaced by functionally different alleles at 
the selected locus, and migration replaced by recombination 
between the neutral and selected sites (Takahata and Satta 
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1998; Kamau et al. 2007). This process also allows retention of 
trans-specific polymorphisms within these sequences, which 
can cause sequences from different related species to be more 
similar than some pairs of sequences within one of the species 
(loerger et al. 1990; Wu et al. 1998), including in neighboring 
genomic regions, if recombination is infrequent (Charles- 
worth et al. 2006). In systems with long-term maintenance 
of selected polymorphisms and low recombination, theoret- 
ical models show that the signature is expected to cover long 
chromosomal tracts (e.g., sex chromosomes). In recombining 
genomic regions, however, the indirect effect of selection, 
producing longer genealogies, increases the effective recom- 
bination rate, through the greater time for recombination 
events since the common ancestor (Schierup, Mikkelsen, 
et al. 2001). The high diversity due to the local increase in 
effective population size caused by linkage to the selected 
locus is therefore reduced to a narrow region. 

Here, we present empirical data on the self-incompatibility 
(SI) polymorphism of a plant and evaluate the effects of these 
processes on the evolution of regions flanking the incompati- 
bility locus. SI is a genetic system preventing self-fertilization 
in certain hermaphrodite plants. Detailed molecular and gen- 
omic studies show that SI in the family Brassicaceae is con- 
trolled by a single genome region (the S-locus) containing two 
closely linked genes: one (the S-locus cysteine rich or SCR 
protein gene) encodes a pollen surface protein and the 
other (the SRK gene) encodes a receptor kinase expressed 
on the surface of stigma papilla cells. Haplotype-specific rec- 
ognition between the two proteins triggers a downstream 
pathway leading to pollen rejection. The main evolutionary 
force driving allele frequency changes at the S-locus is an 
advantage to rare alleles; this negative frequency dependence 
generates balancing selection (Wright 1939). Arabidopsis hal- 
leri and A. lyrata are closely related plants with this SI system. 
Both species show high allelic and sequence diversity at the 
S-locus (Schierup, Mable, et al. 2001; Castric and Vekemans 
2007), and, based on the sequences, a high proportion of 
alleles are shared (Castric et al. 2008). The S-locus region in 
A. halleri and A. lyrata consists of approximately 70-kb-long 
tract of DNA that includes only two protein coding genes, 
SCR and SRK, and ends with a sharp transition from extremely 
high divergence between different S haplotypes to a region of 
high sequence homology across all haplotypes, suggesting 
complete or nearly complete absence of recombination re- 
stricted to within this region (Cuo et al. 2010; Coubet et al. 
2012). 

For the closest flanking genes on each side of the S-locus 
nonrecombining region, indirect effects of balancing selection 
have been inferred. Kamau and Charlesworth (2005) found 
significantly higher silent nucleotide polymorphism at two 
genes (B80 and B720) than at control genes in A. lyrata, 
using the Hudson-Kreitman-Aguade test to take account 
of possible locus-specific mutation rates; however, their 
sample of alleles came from a geographically limited region. 
Similarly, four genes directly flanking the S-locus (B80, B720, 
ARK3, and B760) showed significantly elevated synonymous 
diversity in A. halleri (Ruggiero et al. 2008). Additionally, 
the flanking genes B80 and ARK3 showed trans-specific 



polymorphisms between A. lyrata and A. thaliana (Charles- 
worth 2006) despite the long divergence time between these 
species (at least 5 My, see Koch and Matschinger [2007]). To 
determine the extent of the genomic region influenced by 
selection on the S-locus, Kamau et al. (2007) estimated poly- 
morphism at four flanking genes more distant from SRK and 
found no evidence for elevated diversity. 

Previous estimates of the extent of the region showing 
increased diversity because of genetic linkage to the S-locus 
in Arabidopsis species (Kamau and Charlesworth 2005; 
Hagenblad et al. 2006; Ruggiero et al. 2008) thus suggest 
that the peak of diversity was sharp. However, the structural 
organization of the S-locus genomic region was not well 
enough understood to allow accurate estimation of the phys- 
ical distance between those genes and the S-locus itself, and it 
was known only that physical sizes differ greatly among 
S-haplotypes in A. lyrata (Kusaba et al. 2001). Further alleles 
were recently described in detail in A. lyrata and A. halleri, 
showing that the distance between the flanking genes B80 
and ARK3 varies from 30 to 110 Kb and that the transition 
between the nonrecombining and recombining regions is 
very sharp (Shiba et al. 2003; Guo et al. 2010; Coubet et al. 
2012). This now allows us, for the first time, to sample mul- 
tiple flanking region genes at a range of known physical 
distances from the boundary of the nonrecombining region. 
We here study eight additional loci around the S-locus, pro- 
viding nucleotide sequence data for a wide geographic sample 
of natural populations of A. halleri and A. lyrata, for a total of 
12 loci. 

Recent population surveys of molecular polymorphism 
in the genomic background of A. lyrata (Ross-Ibarra et al. 
2008) and A. halleri (Roux et al. 2011) also now allow us to 
perform better tests for the effects of selection on the loci 
studied. Because the demographic history (Wright and Gaut 
2005) and locus-specific mutation rates may affect levels and 
patterns of variation, we compared diversity in our samples 
against the null "genomic background" expectation obtained 
by coalescent simulations under an explicit demographic 
model of divergence between A. halleri and A. lyrata that 
was inferred by an Approximate Bayesian Computation 
(ABC) approach based on an extensive sequence data 
set at loci unlinked to the S-locus (Roux et al. 2011). We 
first tested this approach using simulated results for a 
nonselected region linked to a balanced polymorphism and 
show that it readily detects an excess of neutral diversity in 
such regions. Our empirical results for the A. lyrata and 
A. halleri S-locus region, using this method, support the con- 
clusion that the signature of balancing selection, in terms of 
excess diversity, is restricted to a very narrow region, not 
exceeding approximately 10 kb each side of the nonre- 
combining S-locus region boundary. We show that the 
excess of polymorphism in these flanking regions includes 
elevated numbers of both shared variants (indicating 
long-term balancing selection) and exclusive polymorphisms 
(within one species or the other, which will also be expected 
even in cases of balancing selection that have not persisted 
for long). 
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Materials and Methods 

Plant Material 

Using DNA samples from both A. halleri and A. lyrata, we 
estimated species-wide diversity and between-species diver- 
gence at genes located in a 260-kb-wide genomic region cen- 
tered on the S-locus (see DNA Sequencing). For A. halleri, we 
sampled 31 individuals from the following 6 natural popula- 
tions scattered throughout the European distribution of the 
species: Auby, France (N = 6); St Leonhard in Passeier, Italy 
(N = 5); Harz, Germany (N = 5); Stojnci, Slovenia (N = 5); 
Katowice, Poland (N = 5); and Zaton, Czech-Republic 
(N = 5). The same samples were used in preliminary studies 
of four genes flanking the A. halleri S-locus (Ruggiero et al. 
2008) and to infer the history of divergence between A. halleri 
and A. lyrata (Roux et al. 2011). Leaves were collected in the 
field, dried, and DNA was extracted as described in Pauwels 
et al. (2006). For A. lyrata, DNA samples were kindly provided 
by O. Savolainen for four populations: Stubbsand (Iceland) 
(N = 5); Spiterstulen (Norway) (N = 5); Karhumaki (Russia) 
(N = 5); and Plech (Germany) (N - 5). These samples were 
used in this study for sequencing eight additional genes 
(see later), whereas the data set already available for four 
loci was obtained exclusively from one (B160 [Kamau and 
Charlesworth 2005]) or eight Icelandic populations (B80, 
ARK3, and B720 [Hagenblad et al. 2006; Kamau et al. 2007]). 

DNA Sequencing 

Nucleotide sequences for four genes flanking the S-locus (B80 
[PUB8 or At4G27350, based on the A. thaliana genome an- 
notation], ARK3 [At4G21380], B720 [At4G27390], and B760 
[At4g21430]; fig. 1) were already available from the literature 
for both species (Kamau and Charlesworth 2005; Hagenblad 
et al. 2006; Ruggiero et al. 2008). We sequenced large exons in 
eight additional genes at intermediate distances on each side 
of the S-locus to estimate sequence diversity over a wider 
genomic region (fig. 1). Genes containing large exons allow 
an efficient direct sequencing strategy, avoiding insertion/de- 
letion variants, which are common in intron sequences 
(Ross-Ibarra et al. 2008). 

Physical distances between the genes studied were esti- 
mated using the annotation of the A. lyrata genome assembly 
(Hu et al. 2011), which corresponds to the S-locus haplotype 
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Fic. 1. Map of the S-locus genomic region in Arabidopsis lyrata, with 
locations of the 12 flanking genes investigated in this study. The shaded 
box indicates the S-locus nonrecombining region containing the SRK 
and SCR genes involved in the incompatibility reaction. Vertical seg- 
ments indicate the positions of fragments studied (the eight fragments 
newly sequenced in this study are underlined). The two arrows indicate 
the precise limits of the flanking regions investigated. 



S13. According to a recent genomic investigation of the 
S-locus region, we define here the S-locus as the apparently 
nonrecombining region containing SCR and SRK, whose limits 
are, on one side, a position 300 bp upstream of the start 
codon of the B80 flanking gene and on the other side, the 
stop codon of the ARK3 flanking gene (Goubet et al. 2012). 
We used these positions as the starting points to calculate 
physical distances from the S-locus to each flanking region 
gene (fig. 1 and table 1). 

Polymerase chain reaction (PCR) amplifications were car- 
ried out in 50 uJ and conditions were the following: 30 cycles 
of 30 s at 95°C, 45 s at 55°C, and 60 s at 70°C Contaminating 
salts, unincorporated dNTPs, and primers were removed from 
PCR products with the Millipore-Multiscreen purification kit. 
PCR fragments were sequenced using the BigDye Terminator 
Kit 3.1 (Applied Biosystems) and run on an ABI-3130 capillary 
sequencer (Applied Biosystems). All sequences were checked 
manually with SeqScape V2.5 and only included in our ana- 
lysis when confirmed on both strands. The GenBank acces- 
sion numbers for the new sequences obtained in this study 
are JX915910-JX916287. 

Nucleotide Sequence Analyses 

Sequences were aligned using the CLUSTALW program 
(Thompson et al. 1994), and alignments were modified 
manually with the MEGA version 4 program (Tamura et al. 
2007). Reading frames were determined by comparison with 
the A. thaliana orthologs. Haplotype data for the eight genes 
that were directly sequenced were inferred using the PHASE 
algorithm implemented in the DNAsp (v.4.50.3) software 
(Librado and Rozas 2009). The algorithm was run with 
100,000 replicates, a thinning interval value equal to 10 and 
a burn-in period of 10,000. In a few cases when haplotype 
phases could not be determined reliably by the PHASE algo- 
rithm (phase probabilities < 0.90 for at least one polymorphic 
nucleotide site), the PCR products were cloned and 
sequenced. DNAsp (v.4.50.3) was used to calculate the 
Watterson's 0 W and Tajima's n estimators of diversity, using 
synonymous sites. 

To jointly describe diversity patterns in the two species, we 
computed the numbers of synonymous sites in each of seven 
classes of polymorphisms (Ramos-Onsins et al. 2004): 1) fixed 
differences between A. halleri and A. lyrata (Sf.hai and Sf.| yr ), 
that is, polymorphic sites whose derived allele frequency /(/) 
equals 1 in one species and 0 in the other; 2) shared 
polymorphic sites (S s ), where 0 </(/') < 1 in both species; 3) 
exclusive polymorphisms in A. lyrata or A. halleri (S x _h a i and 
S x .| yr ), where /(/) = 0 in A. lyrata or A. halleri, but 0 </(/) < 1 in 
the other species; and 4) the two categories of putative an- 
cestral polymorphisms (S x . ha | f .| yr and S x _| yr f . ha |) defined by 
Ramos-Onsins et al. (2004), respectively, corresponding to 
polymorphic sites in A. lyrata or in A. halleri with 0 </(/') < 1 
but where the derived allele is fixed in the other species 
(/(/) = 1). Sequences from the A. thaliana reference genome 
(Col-0) were used as outgroups to infer ancestral and derived 
states. To reduce inaccuracies caused by homoplasy, we 
excluded positions with more than two segregating alleles 
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in the alignments. The number of these positions is slightly 
increased in the three closest loci around the S-locus (supple- 
mentary fig. SI, Supplementary Material online), resulting in a 
conservative test. Software to perform these computations 
(MScalc) is available on request to X. Vekemans. Shared poly- 
morphism with A. thaliana at B80 and ARK3 loci 
(Charlesworth et al. 2006) could bias the assignment of poly- 
morphic sites to the different classes, because the use of a 
single outgroup (A. thaliana) sequence can increase the 
number of polymorphisms inferred as exclusive (e.g., S x .| yr ) 
or putatively ancestral (e.g., S x .| yr f_h a i)< depending on which 
outgroup allele is sampled. Hence, for B80 and ARK3, we used 
additional information on polymorphism in A. thaliana 
(Tsuchimatsu et al. 2010) to exclude such sites from the 
analysis. 

Two statistics estimating intragenic recombination were 
computed using the PAIRWISE program in the LDhat 2.1 
package (http://ldhat.sourceforge.net/, accessed November 
13, 2012): R min , the minimum number of recombination 
events (Hudson and Kaplan 1985) and p, the population 
recombination rate (= 4Nr, with N the effective population 
size and r the recombination rate per nucleotide site), com- 
puted using a composite-likelihood approach (McVean et al. 
2002). To test the null hypothesis of no recombination 
(p = 0), we used the likelihood permutation test option of 
LDhat. Computer simulations have shown that balancing 
selection does not affect the accuracy of recombination 
rate estimates by LDhat (Rich man et al. 2003). 

As the peak of increased synonymous polymorphism was 
found to be very narrow around the S-locus (see Results), 
we used a sliding window approach to analyze the variation 
in 0 W and tt within the two genes B80 and ARK3 immediately 
flanking the S-locus region. We used windows of 30 bp moved 
by steps of 15 sites and calculated Spearman's rank correl- 
ation between the estimated synonymous site diversity in 
each window and its position in the sequenced fragment. 
To assess the statistical significance of the correlation, we 
obtained its null distribution by 10,000 random permutations 
between positions. 

The observed shape of the peak of polymorphism was then 
compared with expected patterns under partial linkage to a 
site under balancing selection, assuming a constant recom- 
bination rate per nucleotide site (c) in the region flanking the 
S-locus and no recombination within the S-locus. Schierup 
et al. (2001) showed, in models of gametophytic SI and over- 
dominant selection, that the rate of decrease in polymorph- 
ism on both sides of a peak centered on the selected locus 
does not depend on the strength of balancing selection. In the 
absence of any equivalent model for sporophytic SI, we used 
analytical predictions for a peak of nucleotide diversity asso- 
ciated with overdominant selection (Takahata and Satta 
1998). We used three different values of the selection coeffi- 
cient (s = 0.1, 0.5, or 0.9) to illustrate the effect of selection 
strength. Expectations for the resulting peak of neutral diver- 
sity were represented by plotting the expected coalescence 
times between randomly chosen pairs of neutral alleles at a 
neutral site partially linked to a site subject to overdominant 
selection, expressed in units of 2N generations (so that a 
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value > 1 indicates an excess of coalescence time over the 
neutral expectation at unlinked sites), as a function of the 
population recombination rate (p-4Nr, where r is the re- 
combination rate between the selected and the neutral site 
per generation). Observed patterns of the peak of neutral 
diversity in A. halleri were represented by plotting the ratio 
of average 6^ values observed in a given distance interval 
from the S-locus (supplementary fig. S2, Supplementary 
Material online) over the 9 W value observed in the genomic 
background (0 W = 0.0218; obtained from Roux et al. [2011]), 
as a function of p. Values for p were computed with 
N- 80,000 individuals (Roux et al. 2011) and using r values 
obtained by multiplying the distance from the S-locus in nu- 
cleotides with the chosen value of c. This procedure allowed 
us to visually identify the value of c that produces the best fit 
of the observed peak of polymorphism to the analytical pre- 
diction and hence obtain a rough estimate of the local re- 
combination rate. 

Coalescent Simulations 

The observed high levels of polymorphism at flanking genes 
in the S-locus region may represent the indirect effect of 
selection, leading to linkage disequilibrium between flanking 
nonselected variants and variants in the S-loci, or a local in- 
crease in mutation rate (Hudson et al. 1987). To distinguish 
the effects of neutral processes (mutation rate differences 
between loci and genetic drift) from those of indirect selec- 
tion in the S-locus region, we compared, for each flanking 
gene, observed values of summary statistics (Watterson's 6 W , 
Tajima's n, and the numbers of sites in each class of poly- 
morphism described earlier) to expected distributions under 
neutrality (supplementary fig. S3, Supplementary Material 
online). These expected distributions were obtained by co- 
alescent simulations assuming a four-parameter demographic 
model of divergence without gene flow between A. halleri and 
A. lyrata (Roux et al. 201 1). Using model-checking procedures 
under the ABC framework, the model of divergence was 
found that best accounts for data on synonymous poly- 
morphism in both species from 29 reference loci unlinked 
to the S-locus region (Roux et al. 2011). Estimates of the 
four parameters of the divergence model (the current effect- 
ive population sizes of A. halleri and A. lyrata (N ha | and N| yr ), 
the effective population size of the common ancestor (N anc ), 
and the time of the split between A. halleri and A. lyrata 
(T sp | it )) were then inferred using ABC and checked with good- 
ness of fit tests (Roux et al. 201 1). Here, to obtain ranges of the 
summary statistics investigated and provide conservative 
tests for these loci, we performed simulations using the full 
joint-posterior distribution of 2,000 parameter combinations 
previously obtained by ABC, rather than the modes of the 
posterior distributions. These simulations used the MSnsam 
program (Hudson 2002; Ross-Ibarra et al. 2008) with recom- 
bination. A total of 10,000 replicates were simulated for each 
flanking locus, by resampling with replacement from the par- 
ameter combinations. We calibrated the mutation rate 
/x = K/2T, where K is the measured synonymous divergence 
from A. thaliana for each flanking locus, assuming a 



divergence time T- 2.5 million generations and a generation 
time of 2 years (Koch and Matschinger 2007). Although this 
may be an underestimate of the divergence time between A. 
thaliana and A. lyrata (Beilstein et al. 2010), its value does not 
influence the expected distributions of summary statistics 
obtained by simulations, as all time estimates and mutation 
rates used in the simulations are relative values. We used the 
population recombination rates p described in the previous 
section. For each summary statistic, P values were estimated 
as the proportion of simulated values higher than the 
observed values. 

Forward Simulations of a Neutral Locus Partially 
Linked to a Balanced Polymorphism 
To validate our approach to detect excess polymorphism at 
flanking genes, we applied the same approach to simulated 
data sets of a neutral locus partially linked to a balanced 
polymorphism with overdominant selection in an isolation- 
with-divergence speciation model. These data sets were gen- 
erated by a simuPOP (Peng and Kimmel 2005) script 
Overdom.py that models evolution forwards in time. The 
neutral flanking locus was assumed to be 2,000 bp long, 
and its population mutation rate per locus (9 = 4N A /z, with 
N A the number of diploid individuals in the ancestral 
population) was 0.004 per generation to limit coincident mu- 
tations. For the selected locus, we modeled symmetric over- 
dominance with different values of the selection coefficient s 
against the homozygotes. Our mutation model allows 20 
functionally different alleles, with a mutation rate to a new, 
functionally different allele, of 0.0001 per generation and an 
initial number of 10 alleles. We first modeled evolution in a 
single ancestral population of effective size N A for 8N A 
generations, until mutation-selection-drift equilibrium was 
reached. Speciation was then assumed to occur, and the 
ancestral population was split into two populations with 
effective sizes N n and N 2 , which then evolved in isolation 
for a further T sp | it generations. For each replicate simula- 
tion, we then sampled alleles at the neutral locus from 
30 diploid individuals from each daughter population. On 
the basis of these simulated data sets, we then performed 
a full analysis like that applied to the sequence data, 
including the ABC and subsequent analysis, with the follow- 
ing six steps: 

1) For each replicate i of the validation process, we ran- 
domly sampled a set 4>, of values of the four demo- 
graphic parameters N A , N v N 2/ and T sp | it . We first 
sampled the number of diploid individuals N A from a 
uniform distribution [100-5,000]. and N 2 were then 
randomly and independently chosen in the interval 
(100-2N A ). T sp | ic was chosen from a uniform distribution 
(0.1-8). N„ where N x is the smaller of {N x N 2 ). 

2) We simulated 30 neutral loci in the isolation-with- 
divergence speciation scenario parameterized by <P, 
using the coalescent simulation software ms (Hudson 
2002) and sampled 60 haplotypes in both daughter 
populations. 
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3) From the multilocus data set simulated in step 2 for 
replicate /', we estimated a joint posterior distribu- 
tion of 2,000 parameter values by performing an ABC 
analysis following the procedure described in Roux et al. 
(2011). 

4) We found neutral expectations for the diversity statistics 
(jt, 9, the number of exclusive polymorphic sites S x , and 
the number of shared polymorphic sites S s ) by running 
10,000 coalescent single-locus simulations under the di- 
vergence model, with parameter values sampled from 
the joint-posterior distribution obtained in step 3 for 
replicate /'. A total of 60 haplotypes of the neutral locus 
were sampled from each population. 

5) We then simulated a neutral locus partially linked to a 
balanced polymorphism using Overdom.py, with 
demographic parameters <P, and three different recom- 
bination distances between the neutral locus and the 
selected target: p = 4N A , r = 0.0001, p = 0.001, and 
r = 0.5, where r is the rate of recombination per 



gamete per generation between the two loci. We used 
the same length, mutation rate, and sample size as in 
step 4. 

6) The computed statistics for the partially linked neutral 
locus in step 5 were compared with neutral expectations 
computed in step 4 by computing P values. 

Results 

Simulations to Validate Our Approach to Detecting 
Excess Polymorphism in Genome Regions Flanking a 
Balanced Polymorphism 

From 38 independent iterations of the six-step procedure just 
described, we tested the simulated values of several diversity 
statistics against the neutral expectations under the same 
population history. The results are presented in figure 2 for 
three levels of linkage to the selected locus: highly linked 
(p = 0.0001), moderately linked (p = 0.001), and unlinked 
(p = 0.5). As expected, the unlinked case yielded values of 
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Fic. 2. Results of the validation procedure for the approach used to detect excess polymorphism in a neutral locus partially linked to a balanced 
polymorphism. Red and pink areas, respectively, represent the 95% and 99% limits of the expected neutral distributions obtained by coalescent 
simulations from ABC's joint posteriors for: (A) Watterson's 0 W per bp, (B) it per bp, (C) the number of shared, and (D) exclusive polymorphic sites. The 
x axis represents the different replicates of the validation procedure with different demographic parameters, sorted in ascending order of the 99.5% 
quantile. White, gray, and black dots represent observed values of the statistics computed at the neutral locus at increasing recombination distance 
(,0 = 0.5, p = 0.001, and p = 0.0001, respectively). The dotted lines represent the thresholds above which any value of the statistic is considered as a 
deviation from neutrality. 
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the polymorphism statistics (nucleotide diversity n, 
Watterson's 9 W , proportion of exclusive polymorphisms S x , 
and proportion of shared polymorphisms S s ) that were 
mostly within the neutral expectation boundaries. In contrast, 
for the highly linked case, significantly higher values were 
common, except for the S x statistic, for which only about 
half of the values obtained were significant. The moderately 
linked case gave intermediate values, but most were signifi- 
cantly higher than the neutral expectations. This shows that 
our approach can potentially detect the indirect effects of a 
locus under balancing selection. 
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A Narrow Signature of Balancing Selection around the 
S- Locus 

Table 1 lists the previously published and newly sequenced 
loci and their distances from the nonrecombining S-locus 
genomic region. The physical distance from the S-locus 
strongly affects the extent of departure from neutral expect- 
ations in both A. halleri and A. lyrata. Eight of the loci most 
distant from the S-locus (At4g21170, At4g21200, At4g21230, 
At4g21323, B160, At4g21440, At4g27550, and At4g21620) con- 
sistently had diversity within the range of values expected 
under the neutral hypothesis without indirect selection, 
whereas the three genes closest to the S-locus (B80, ARK3, 
and B720) had high synonymous diversity and differed signifi- 
cantly from neutral expectations for both Watterson's 9 W 
and nucleotide diversity it (table 2, fig. 3, and supplementary 
table S1, Supplementary Material online). Unexpectedly, one 
of the distant flanking region loci, At4g21480, which is 
approximately 60 Kb downstream from the S-locus showed 
a significant excess of polymorphism for both diversity esti- 
mators in A. halleri with four highly diverged haplogroups, 
each consisting of a set of closely similar haplotypes. One of 
these sets is similar to the A. thaliana reference sequence 
(supplementary fig. S4, Supplementary Material online). In 
A. lyrata, that haplogroup predominated, but a single se- 
quence (from Norway) was found to belong to another of 
these A. halleri haplogroups (supplementary fig. S4, 
Supplementary Material online). The nucleotide diversity at 
this locus in A. lyrata is thus low and nonsignificantly different 
from the neutral expectation, whereas Watterson's 9 W 
showed a significant excess of polymorphism. In addition, 
four A. halleri At4g21480 haplotypes in one of the hap- 
logroups appear nonfunctional, with a premature stop 
codon (as a consequence of a single base pair insertion). 
This haplotype was found in three A. halleri populations 
(from Slovenia, Poland, and the Czech Republic); it also has 
a 36-bp deletion elsewhere in the sequence. 

Sliding window analysis of synonymous diversity across the 
sequenced fragment of the B80 gene directly adjacent to the 
S-locus revealed a significant negative correlation between 
distance from the S-locus and both 0 W (Spearman's 
r= -0.428; P = 0.0025) and n (Spearman's r= -0.300; 
P = 0.0248) in A. halleri (fig. 4). This further supports a 
sharp decline of polymorphism at very short distances from 
the S-locus. However, no correlation was found in A. lyrata 
(for 6» w : Spearman's r= -0.428; P = 0.4415 and for 
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Fig. 3. Histograms of the observed synonymous diversity measures per base pair for the 12 S-locus flanking genes in Arabidopsis halleri (A and B) and A. 
lyrata (C and D) and their expected distribution obtained from coalescent simulations under the null hypothesis of no linkage to selected sites. 
Synonymous polymorphism was estimated by both the nucleotide diversity jt (A and C) and Watterson's 6 W (B and D). Shaded boxes indicate the 
one-tailed 95% confidence intervals for each locus, obtained by simulations. Solid dots indicate the observed values. Thick horizontal lines indicate the 
median of each distribution. 
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Fig. 4. Sliding window analysis of synonymous polymorphism measured 
by t9 w (expressed per bp) within the B80 flanking gene in (A) Arabidopsis 
halleri and (B) A. lyrata. The x axis is the nucleotide position (in bp) 
relative to the closest nucleotide from the S-locus in the sequenced 
fragment. Regression line is only shown for A. halleri (R = — 0.4340) 
because the estimated measure of association was significant after a 
permutation test (P = 0.0025), while not for A. lyrata (R = 0.1 113). 

it: r - -0.300; P = 0.6122), and none was found at the ARK3 
gene in either species. 

The Excess of Polymorphism Has Two Distinct 
Origins 

The high synonymous nucleotide polymorphism at ARK3, 
B80, and B720 in both species are associated with significantly 



more shared polymorphic sites (S s ), suggesting an excess of 
ancestral polymorphism compared with neutral expectations 
(fig. 5 and table 2). Specifically, 5 shared synonymous site 
polymorphisms were observed in the ARK3 gene 
(P = 0.0258), 31 in B80 (P < 0.0001) and 9 in B120 
(P = 0.0046). Sites with a derived allele polymorphic in one 
species but fixed in the other species (another type of puta- 
tive ancestral polymorphism) were also in excess at ARK3 
in A. lyrata (S x . ha | f_i yr =3, P- 0.0256) but not in A. halleri 

(Sx-half-lyr=1,P = 0.1824). 

The maintenance of ancestral polymorphisms was not, 
however, the only cause of elevated polymorphism in these 
three genes. We also found an excess of derived exclusive 
polymorphisms in B80 in both A. halleri (S x _h a i = 13, 
P = 0.0255) and A. lyrata (S x _, yr = 23, P< 0.0001), in ARK3 in 
A. lyrata (S x .| yr = 10, P = 0.013) but not in A. halleri (S x . ha i = 9, 
P = 0.0598), and in B720 in A. halleri (S x . hal = 12, P = 0.01 13) 
but not A. lyrata (S x .| yr = 12, P- 0.077). As explained earlier, 
the polymorphisms shared between A. lyrata and A. thaliana 
previously reported in the B80 and AR/G loci (Charlesworth 
et al. 2006) could bias the ascertainment of mutation types, 
because when true ancestral polymorphisms segregate in the 
outgroup species, the use of a single outgroup sequence can 
increase the number of polymorphic sites inferred to be 
derived, depending on which outgroup allele was sampled. 
Taking polymorphism in A. thaliana into account for B80 and 
ARK3 indeed reduces the numbers of derived polymorphic 



442 



Recent and Ancient Signature of Balancing Selection ■ doi:10.1093/molbev/mss246 



MBE 




CO 



0.21 
0.14 - 
0.07 
0 - 1 



0.21 



>■ 0.14 



CO 



t — OJ OJ CO CO 

CM fd OJ OJ 
O) D3 Ui O) 



CO 

en 
< 



CO lO OJ 



in m >- T- ■- T- 

OJ OJ OJ CM 



■"J 
< 



< 



O) U> O) O) 



< 



< 



t — OJ OJ CO CQ 

OJ CM C\J OJ 
D) D) D3 O) 



CO 

cc 
< 



CO LD CM 



< 



D ID t- t- r r 

OJ CM OJ OJ 

O) O) D3 D3 

< < < < 




Fic. 5. Distribution of the proportions of synonymous polymorphic sites observed per nucleotide in the 12 S-locus flanking genes, according to the six 
categories of polymorphism: halkrj and lyrata are the proportion of exclusive derived polymorphic sites in Arabidopsis halleri and A. lyrata, 
respectively; S x . ha | f .| yr and S x .| yr f , ha | are the proportion of polymorphic sites with a derived allele fixed in one species but still in segregation in the other; 
"Fixed differences" is the number of positions with different fixed alleles in A. halleri and A. lyrata; and S s is the proportion of shared polymorphic sites in 
both species. Shaded boxes indicate the one-tailed 95% confidence intervals for each locus, obtained by simulations under the null hypothesis of no 
linkage to selected sites. Solid dots indicate the observed values. Thick horizontal lines indicate the median for each distribution. 



sites, but there is still a significant excess for both genes in 
A. lyrata (supplementary tables S2 and S3, Supplementary 
Material online). 

Recombination Rates in Genes in the Region Flanking 
the S-Locus 

The local recombination rates in the regions flanking the 
S-locus are poorly known, whereas the S-locus itself is essen- 
tially nonrecombining (see Introduction). Our results show 
substantial evidence for intralocus recombination in ARK3, 
B80, and B120 in both A. halleri and A. lyrata (table 3). 
The population recombination rates per nucleotide (p) in 
A. halleri were 0.068, 0.1397, and 0.0346 for ARK3, B80, and 



B120, respectively, which are somewhat higher than the aver- 
age for 29 genes unlinked to the S-locus in A. halleri 
(mean = 0.020; (Roux et al. 2011). Recombination estimates 
were slightly lower in A. lyrata than in A. halleri, with 
p = 0.0136, 0.0837, and 0.0074 for ARK3, B80, and B720, 
respectively. 

Assuming that the observed decrease in polymorphism 
across B80 with distance from the S-locus reflects the effect 
of recombination, we estimated the local recombination rate 
in the S-locus flanking region using a different approach. This 
is based on comparing observed and expected diversity peaks 
for neutral sites partially linked to a locus subject to balancing 
selection. The fit of the observed to the expected peak de- 
pends mainly on the value assumed for the local 
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recombination rate in the flanking region, whereas different 
strengths of selection at the selected locus (overdominant 
selection with coefficient s) have only minor effects (fig. 6). 
The best fit was obtained assuming r x 1 cM/Mb, which is 
somewhat lower than the rate of recombination estimated 
for the genomic background in A. halleri (6.25 cM/Mb; Roux, 

Table 3. Intralocus Recombination Rates Measured by Rmin per 
Locus and p per bp. 

Locus Species R min p = 4Nr P* 

At4g21170 Arabidopsis halleri 5 0.0323 0 

A. lyrata 4 0.0148 0.004 

At4g21200 A. halleri 0 0.0791 0.333 

A. lyrata 0 0.048 0.7 

At4g21230 A. halleri 3 0.0296 0.013 

A. lyrata 2 0.0374 0.276 

At4g2T323 A. halleri 6 0.0103 0.126 

A. lyrata 0 0 0.558 

B80 A. halleri 27 0.1397 0 

A. lyrata 18 0.0837 0 

ARK3 A. halleri 4 0.068 0 

A. lyrata 4 0.0136 0.02 

BT20 A. fiaHeri 4 0.0346 0 

A. /yrata 2 0.0074 0 

B160 A. fiaHeri 5 0.0094 0 

A. /yrata 0 0 0.05 

At4g2T440 A. fia//eri 5 0.0563 0.103 

A. /yrata 8 0.0689 0 

At4g21480 A. fia/teri 7 0 0.975 

A. /yrata 1 0 0.136 

At4g21550 A. halleri 3 0.0463 0 

A. lyrata 0 0 0.622 

At4g21620 A. fia/teri 0 0.0346 0 

A. /yrata 0 0.0192 0.105 

*P value of the test for detecting intragenic recombination using a likelihood per- 
mutation procedure performed with LDhat. 



unpublished) or the overall estimate obtained for the 
4Mb-wide region flanking the S-locus (3.8 cM/Mb 
[Ruggiero et al. 2008]). Overall, our results suggest that re- 
combination occurs in the regions immediately flanking the 
S-locus. 

Discussion 

Improved Resolution of the Diversity Peak around the 
S-Locus 

Our new test, using physical distance information between 
flanking genes and the S-locus, and also an explicit demo- 
graphic model of divergence between A. lyrata and A. halleri, 
confirms previous observations that the three closest genes 
flanking the S-locus exhibit a signature of increased poly- 
morphism due to linkage to the selected locus. This highly 
localized signature is in agreement with theoretical studies 
predicting very localized effects of long-term balancing selec- 
tion on neutral linked regions. Normal recombination rates 
are expected to be sufficient to restrict the elevation of neu- 
tral polymorphism to a small region of just a few hundred 
base pairs from the putatively selected site within the 
Drosophila melanogaster ADH locus as described in Hudson 
and Kaplan (1988). As shown by Schierup, Mikkelsen, et al. 
(2001), longer gene genealogies in regions flanking loci linked 
to a site under balancing selection increase the local effective 
recombination rate. Our results confirm that, despite the 
strong balancing selection operating at the S-locus, the ex- 
pected elevation of nucleotide diversity due to indirect selec- 
tion affects only the immediately flanking genes. On one side 
of the S-locus, ARK3 and B720, within 4.24 kb of the boundary 
of the nonhomologous region, have increased polymorphism, 
but the next locus sequenced, B160, at 28.04 kb, does not. On 
the other side, high synonymous diversity was estimated at 
B80 in both A. halleri and A. lyrata but not at At4g21323, 1 5 kb 



At4g21170 

At4g21200 

At4g21230 

At4g21323 

B80 

ARK3 

B120 

B160 

At4g21440 
At4g21480 
At4g215S0 
At4g21620 



Arabidopsis halleri 


5 


0.0323 


A. lyrata 




u.u llo 


A. halleri 


0 


0.0791 


A. lyrata 


U 


u.u*#o 


A. halleri 


3 


0.0296 


A. lyrata 


-> 

L 




A. halleri 


6 


0.0103 


A. lyrata 


0 


0 


A. halleri 


27 


0.1397 


A. lyrata 


18 


0.0837 


A. halleri 


4 


0.068 


A. lyrata 


4 


0.0136 


A. halleri 


4 


0.0346 


A. lyrata 


2 


0.0074 


A. halleri 


5 


0.0094 


A. lyrata 


0 


0 


A. halleri 


5 


0.0563 


A. lyrata 


8 


0.0689 


A. halleri 


7 


0 


A. lyrata 


1 


0 


A. halleri 


3 


0.0463 


A. lyrata 


0 


0 


A. halleri 


0 


0.0346 


A. lyrata 


0 


0.0192 



10 




• Overdominance s = 0.] 
Overdominance s = 0.5 
■Overdominance s = 0.9 
-Obs B80 r = 1 cM/Mb 
-Obs B80 r = 0.041 cM/Mb 
-Obs B80r=6.25cM/Mb 



0.01 



0.1 




10 



CD 



CD 



1 

p = 4A/r 



10 



100 



Fig. 6. Comparison between observed versus expected patterns of diversity peaks for neutral sites partially linked to a locus subject to balancing 
selection. Dotted and interrupted lines represent expected values under overdominant selection with different selection intensities. Continuous lines 
represent observed values for the B80 flanking gene in Arabidopsis halleri with different values assumed for the per nucleotide recombination rate in the 
flanking region. 



444 



Recent and Ancient Signature of Balancing Selection ■ doi:10.1093/molbev/mss246 



MBE 



upstream from B80. This is consistent with results from 
Kamau and Charlesworth (2005), who found low diversity 
in A. lyrata at the B70 gene (At4g27340) flanking B80 at 
approximately 5 kb. 

Moreover, a rapid decline in nucleotide diversity occurs 
even within the B80 gene in A. halleri, indicating recombin- 
ation within B80. The diversity pattern yielded an estimated 
recombination rate in the S-locus flanking region of 1 cM/Mb, 
only slightly lower than the genomic average (6.25 cM/Mb; 
Roux, unpublished). No such decrease in diversity was 
observed in B80 in A. lyrata, however, possibly indicating 
that recombination in this region is lower in this species. 
This would be consistent with the lower estimate of recom- 
bination obtained with LDhat in this study or from other 
direct or indirect approaches in previous studies (Kamau 
and Charlesworth 2005; Hansson et al. 2006; Kawabe et al. 
2006). However, the sample previously studied from this spe- 
cies was geographically limited (only populations from 
Iceland were used), which could produce higher LD and 
thus lower recombination rate estimates. Indeed, Ross- 
Ibarra et al. (2008) reported that estimates of the overall 
population recombination rate in a population from 
Iceland were much lower than estimates obtained from the 
Plech reference population from Germany, which had been 
identified as part of the center of diversity of A. lyrata. Further 
work using better sampling should clarify this issue. 

In contrast to studies focusing on detecting indirect effects 
of selection associated with selective sweeps, few empirical 
studies have focused on the indirect signature of balancing 
selection in different species. In A. thaliana, a plant species 
with a low effective recombination rate, no elevation of poly- 
morphism was detectable in neutral regions 10 kb away from 
RPS5, a gene involved in pathogen resistance, and thought to 
be evolving under strong balancing selection (Tian et al. 
2002). The timescale of this balanced polymorphism may 
be shorter than that at the S-locus, but linkage disequilibrium 
in the highly selfing A. thaliana extends over larger distances 
than in the outcrossing A. lyrata and A. halleri (Hu et al. 201 1). 
In the human genome, surveys searching for signatures of 
indirect selection around loci under balancing selection 
have also shown rather narrow peaks, seriously limiting the 
ability of this approach to identify targets of balancing selec- 
tion across the genome (Akey et al. 2002, 2004; Bubb et al. 
2006; Andres et al. 2009). 

An apparent exception to the narrow diversity peak is the 
excess of polymorphism detected at the At4g21480 gene, 
which is located 60 kb away from the S-locus. This does not 
seem to be due to linkage to the S-locus, as another gene 
(At4g21440) located closer to the S-locus on the same side 
shows no sign of a departure from neutrality, but it might be 
due to a distinct balancing selection process unrelated to SI 
that maintains both functional and nonfunctional alleles 
(see Results). The At4g21480 gene encodes sugar transporter 
protein 12 in A. thaliana, which is highly expressed during 
formation of nematode-induced root syncytia (Hofmann 
et al. 2009). Inactivation of the gene through T-DNA insertion 
in transformed A. thaliana individuals reduced infection rate 
by male individuals of the nematode Heterodera schachtii. 



Possibly, the signature of balancing selection at this gene is 
caused by long-term host-parasite interactions. 

Accumulation of Deleterious Mutations in Linkage 
Disequilibrium with S-Alleles 

The very restricted signature of linkage disequilibrium around 
the S-locus makes the genetic basis of the genetic load asso- 
ciated with different S-locus alleles (Bechsgaard et al. 2004; 
Llaurens et al. 2009) very puzzling. Such results are generally 
interpreted as resulting from the accumulation of wholly or 
partially recessive deleterious mutations in the genomic 
neighborhood of the S-locus (Uyenoyama 1997). However, 
the nonrecombining region contains only the two SI genes 
(SRK and SCR), and recombination clearly starts again imme- 
diately after the two flanking genes B80 and ARK3 (Goubet 
et al. 2012). Outside this region, our calculations suggest that 
at most nine coding sequences could be influenced by linkage 
to the S-locus, based on the detected effects on silent diversity 
(supplementary table S4, Supplementary Material online). 
These nine genes are therefore particularly relevant candi- 
dates to investigate the genetic basis of the sheltered load 
within natural populations. An interesting alternative possi- 
bility is that the sheltered load could be associated with non- 
coding loci such as small RNAs, which are abundant in the 
S-locus region in association with a high density of transpos- 
able elements (Goubet et al. 2012). 

Distinguishing between Recent and Ancient 
Signatures of Balancing Selection 
In plant SI, negative frequency-dependent selection has been 
clearly identified as the long-term selective agent responsible 
for the maintenance of S-locus polymorphism (Wright 1939; 
Charlesworth 2006; Hagenblad et al. 2006). Interestingly, we 
found both recent and ancient signatures of indirect effects 
of selection on the S-locus flanking genes (fig. 5), which is 
in agreement with the fact that SI is known to be ances- 
tral in the genus Arabidopsis (Bechsgaard et al. 2006; 
Castric et al. 2008) and is still functional in A. lyrata and 
A. halleri (Schierup, Mable, et al. 2001; Castric and Vekemans 
2007). 

Our approach, involving testing separately for an excess of 
recently derived polymorphisms and for an excess of ancestral 
polymorphisms, could be useful in other systems to distin- 
guish cases where balancing selection occurred long ago (but 
may no longer be acting) or situations where it started to act 
only recently (newly evolved balanced polymorphisms). For 
instance, in the vertebrate, major histocompatibility system, 
the selective force (parasite-mediated overdominance, nega- 
tive frequency-dependent selection, sexual selection, or mate 
choice), and the putative role of linked deleterious alleles 
remain highly debated (Bernatchez and Landry 2003; 
Piertney and Oliver 2006; van Oosterhout 2009). Obtaining 
information on the timescale of the selection pressure would 
potentially help testing alternative mechanisms (Garrigan and 
Hedrick 2003). 
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Supplementary Material 

Supplementary figures S1-S4 and tables S1-S4 are available at 
Molecular Biology and Evolution online (http://www.mbe. 
oxfordjournals.org/). 
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